Skip to content

Qualcomm AI Engine Direct - heap profiling at runtime on target#19716

Open
jethroqti wants to merge 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/memory/profiling_rt
Open

Qualcomm AI Engine Direct - heap profiling at runtime on target#19716
jethroqti wants to merge 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/memory/profiling_rt

Conversation

@jethroqti
Copy link
Copy Markdown
Contributor

Qualcomm AI Engine Direct - heap profiling at runtime on target

Summary:
Heap profiling at runtime with HTP backend on Android platforms. DSP
heap profiling is available for QnnContext_createFromBinary use-cases.
It captures total DSP heap usage at two checkpoints:
- Before the first context is created (before_context_created)
- After the last context is freed (after_context_freed)

The difference between the two values represents heap consumed during
context execution. The value after freeing is typically equal to or
greater than before creation.

Test plan:
python backends/qualcomm/tests/test_qnn_delegate.py
TestQNNQuantizedUtils.test_qnn_backend_runtime_option_heap_profile -b
build-android -H ${HOST} -s ${SN} -m ${SOC_MODEL}

Note:
This test is expected to run on target device.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 21, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19716

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2026
@jethroqti
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label May 21, 2026
@jethroqti
Copy link
Copy Markdown
Contributor Author

@psiddh
Already added protection, based on #19224, when test on x86 machine. Please take a look.
Sorry for your inconvenience.

@shewu-quic

@psiddh
Copy link
Copy Markdown
Contributor

psiddh commented May 26, 2026

Can you rebase your PR please?

@psiddh
Copy link
Copy Markdown
Contributor

psiddh commented May 26, 2026

@claude review this PR

@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Claude finished @psiddh's task in 3m 29s —— View job


Code Review: Qualcomm AI Engine Direct - Heap Profiling at Runtime on Target

  • Read changed files and understand the implementation
  • Review runtime C++ changes
  • Review Python/test changes
  • Provide review feedback

Summary

This PR adds DSP heap profiling for the QNN HTP backend at runtime. It captures heap usage at two checkpoints (before first context creation, after last context freed) and writes them to a file. The implementation uses static state to coordinate across multiple contexts.


Issues & Suggestions

1. Race condition: WriteHeapProfile reads runtime option without lock

QnnContextCommon.cpp:20-38WriteHeapProfile() calls get_runtime_option(QNN_RUNTIME_HEAP_PROFILING_PATH, ...) which internally calls get_option(QNN_BACKEND, ...) on the backend's get_option. The backend's get_option does hold runtime_option_mutex_, so this specific access is safe. However, WriteHeapProfile is called from both Configure() and ~QnnContext() after the mutex is released. If a concurrent set_option changes the path between the profile-decision and the WriteHeapProfile call, you'd write to a different file than expected. This is likely low-risk in practice but worth documenting.

2. Potential issue: Profile handle ownership and lifetime

In QnnContextCommon.cpp:55-56, the destructor passes qnn_profiler_->GetHandle() to qnn_context_free. Then in QnnProfile::~QnnProfile(), the profile handle is freed via qnn_profile_free. This ordering relies on QnnContext destroying qnn_profiler_ (unique_ptr) after the context free call — which is guaranteed by C++ destruction order (member destructors run after the destructor body). This is correct but subtle.

3. Static state not reset across program runs

QnnContextCommon.cpp:17-18:

std::mutex QnnContext::htp_context_mutex_;
int QnnContext::htp_context_count_{0};

The static htp_context_count_ is never explicitly reset. If the program creates and destroys contexts, then creates new ones, the count would be 0 again (since destructors decrement it). This works correctly. However, if a context creation fails after the count was incremented (the rollback logic handles this), and then subsequent contexts are created, the "first context" profiling would not trigger. The rollback at line 120-126 handles this case correctly.

4. File truncation logic could lose data in edge cases

QnnProfiler.cpp:233-250 — The file open mode logic:

  • before_context_createdstd::ios::trunc (truncates file)
  • after_context_freedstd::ios::app (appends)

This assumes events always arrive in order (before then after). If for any reason the QNN SDK returns events out-of-order, the file would contain incorrect data. Consider adding a check or comment that this ordering is guaranteed by the SDK.

Additionally, the file is opened and closed separately for each event write (lines 244-250). Opening with trunc for the first event and app for the second is reasonable, but opening the file once and writing both would be more efficient and atomic.

5. Missing runtime_option_mutex_ in set_option (pre-existing, but relevant)

Looking at QnnExecuTorchBackend.cpp:205, set_option does hold runtime_option_mutex_. Good — this is consistent with the newly added get_option lock at line 278.

6. Test validation could be more robust

backends/qualcomm/tests/utils.py:415-435 — The validate_heap_profile function:

values = [int(line.split(",")[1].strip()) for line in f if line.strip()]

This parsing is fragile. If the format ever changes (e.g., extra comma in the identifier), it would silently parse wrong values. Consider splitting on the last comma, or validating the identifier prefix matches expectations.

Also, difference is computed as after - before and printed as f"{difference:.2f} bytes", but difference is an int (since both values are parsed with int()). The .2f format specifier is misleading — this will always show .00.

7. BackendOptions<9> — magic number

qnn_executor_runner.cpp:223 — Bumping from <8> to <9> is correct for adding one more option, but this magic number will need updating each time a new option is added. A comment or constant would help maintainability, though this follows the existing pattern.

8. Multi-line comment in QnnContextCommon.cpp

Lines 85-95 contain a multi-line block comment explaining the heap profiling logic. Per the codebase style guidelines in CLAUDE.md, minimal comments are preferred and code should be self-documenting. However, since this documents a QNN SDK constraint that isn't obvious from the code, it's acceptable — but could be trimmed to 2-3 lines.


Minor Nits

  • backends/qualcomm/debugger/README.md:81 — Good fix for the trailing space in .bin .
  • QnnProfiler.cpp:27-28 — The #define macros (DEFINE_HEAP_BEFORE_CREATION, DEFINE_HEAP_AFTER_FREED) would be better as constexpr string literals in the anonymous namespace, matching modern C++ style. Defines leak into all translation units that include headers.
  • HtpContextCustomConfig.h:38-41 — The [[maybe_unused]] approach is pragmatic; the comment explaining it is helpful.

Overall Assessment

The implementation is well-structured with proper mutex protection for the static context counter, correct rollback on failure, and appropriate x86_64 test guards. The code follows existing patterns in the QNN backend. The main areas for improvement are the file I/O pattern in ProfileDataToFile (opening file twice) and the test parsing robustness.

Verdict: Looks good overall. The thread-safety model is sound, the feature is appropriately guarded behind profile level checks, and tests cover both quantized and floating-point paths. Minor improvements suggested above but nothing blocking.


@jethroqti
Copy link
Copy Markdown
Contributor Author

jethroqti commented May 26, 2026

Can you rebase your PR please?

Sure.
@psiddh Finished the rebase.

    Summary:
    Heap profiling at runtime with HTP backend on Android platforms. DSP
    heap profiling is available for QnnContext_createFromBinary use-cases.
    It captures total DSP heap usage at two checkpoints:
    - Before the first context is created (before_context_created)
    - After the last context is freed (after_context_freed)

    The difference between the two values represents heap consumed during
    context execution. The value after freeing is typically equal to or
    greater than before creation.

    Test plan:
    python backends/qualcomm/tests/test_qnn_delegate.py
    TestQNNQuantizedUtils.test_qnn_backend_runtime_option_heap_profile -b
    build-android -H ${HOST} -s ${SN} -m ${SOC_MODEL}

    Note:
    This test is expected to run on target device.
@jethroqti jethroqti force-pushed the dev1/memory/profiling_rt branch from eff7a0f to 7a83ebe Compare May 26, 2026 13:38
@linux-foundation-easycla
Copy link
Copy Markdown

CLA Not Signed

@jethroqti
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants